pp_pack.c: Convert from using 'to_uvchr' functions to 'to_uv' #23564

khwilliamson · 2025-08-12T01:31:53Z

This series of commits converts the calls that convert from UTF-8 to code points from using the utf8n_to_uvchr() forms to instead use the newer utf8_to_uv_foo() forms.

One reason to do this is that the newer forms have a much easier interface for returning to the caller if the input was malformed. The old interface was arcane, and this file did not use it correctly. Since at least 5.8.9, just one of the calls in this file knew that the string was malformed. Everywhere else, it thought things were hunky dory when they were not.

That meant that the code to cope with malformedness was dead; it never got executed. This p.r. rectifies that. However, it turns out that t/utf8_decode.t was relying on one of the code paths being dead. I had to remove the check when it came alive.

So I took care to make each commit in this series change only one call; to aid in bisecting if field experience indicates other calls should not check for validity.

This set of changes requires a perldelta entry

Change a few lines that affect the next commits that use a coding style not conforming to current expectations And also add some blank lines to serve as "paragraph" markers between code segments, and change the indentation of a few lines in preparation for blocks being added and removed in the next commits.

This converts a call to utf8n_to_uvchr() to instead use utf8_to_uv_flags(). This is part of the process of replacing all the '_uvchr' functions with '_uv' functions, which are now preferred. See perlapi. There is a subtle change of behavior here when the string to convert is malformed. I believe this is a bug fix. The macro NEXT_UNI_VAL assumed the failure return of utf8n_to_uvchr() is 'retlen' being -1. However, that value is only ever returned if the flags passed to the function include UTF8_CHECK_ONLY. Inspection of the code showed that this macro is never called with that flag; so this never currently fails. I looked at Perl v5.8.9, and the situation was the same. I doubt it has changed until now. This commit changes to use utf8_to_uv_flags(), and its failure return is properly checked for. The upshot is that this could now fail; whereas it couldn't before, even though the clear intent of the code has been to fail upon error.

Like the previous commit, this converts a call to utf8n_to_uvchr() to instead use utf8_to_uv_flags(). And like that commit, the previous code thought it was checking for failure and croaking when it occurred, but was checking the wrong way so that it never could fail. And now it does a proper check and can fail.

The new form croaks under failure; the previous didn't. But to get here the previous paragraph of code had to succeed, which means this should too. The flag in the changed call is the default for utf8_to_uv_or_die(), so it can be removed.

This converts a call to utf8n_to_uvchr(), using instead utf8_to_uv(). The flag parameter UTF8_ALLOW_DEFAULT before this commit is the default for utf8_to_uv(), so it can be omitted. Like previous commits, the previous code thought it was checking for failure and croaking, but was checking the wrong way so that it never could fail. In the new commit, I had to remove that checking, because it caused test suite failures.

Like previous commits, this converts a call to utf8n_to_uvchr() to use instead utf8_to_uv_flags(). And like previous commits, the previous code thought it was checking for failure and croaking, but was checking the wrong way so that it never could fail. With the new commit, it can fail. Also, the previous commit stored the result in an 'int', whose range can be significantly less than the code point returned. The value was truncated without notice. Now, I croak if the value got truncated. I wonder what I should do, as this is for unpacking a 'c' format, which is supposed to be a signed 8-bit value.

Like previous commits, this converts a call to utf8n_to_uvchr() to instead use utf8_to_uv_flags(). Unlike those commits, the previous code did in fact properly check for failure. But upon failure, it semi-unsafely assumed the start byte was accurate as to the intended byte length of the input character. (It did check that the returned value did not go past the end of the buffer, by using UTF8_SAFE_SKIP() ). The new function call always returns the correct number of bytes, so just use that instead of trying to recompute it. The mechanism for deteriming failure no longer depends on the CHECK_ONLY flag, so that is removed.

Prior to this commit, the code went through the buffer with warnings turned off, but saving the fact if there were problems. If so, it ran through the buffer again, with warnings turned on to display those problems. I'm unsure of why it didn't output any warnings until it got to the end. But utf8_to_uv_msgs() allows us to store up the messages and output them later, without having to reparse

The reason for it to be more than this was removed in the previous commit

khwilliamson added 9 commits August 11, 2025 18:35

pp_pack: unpack: Use 'uv' form, not 'uvchr'

f18e772

The new form croaks under failure; the previous didn't. But to get here the previous paragraph of code had to succeed, which means this should too. The flag in the changed call is the default for utf8_to_uv_or_die(), so it can be removed.

pp_pack.c: Convert variable to bool

6556ff3

The reason for it to be more than this was removed in the previous commit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pp_pack.c: Convert from using 'to_uvchr' functions to 'to_uv' #23564

pp_pack.c: Convert from using 'to_uvchr' functions to 'to_uv' #23564

Uh oh!

khwilliamson commented Aug 12, 2025

Uh oh!

Uh oh!

pp_pack.c: Convert from using 'to_uvchr' functions to 'to_uv' #23564

Are you sure you want to change the base?

pp_pack.c: Convert from using 'to_uvchr' functions to 'to_uv' #23564

Uh oh!

Conversation

khwilliamson commented Aug 12, 2025

Uh oh!

Uh oh!